Back

Thesis Project

Evaluating the efficiency of Twitter Sentiment Analysis as a tool of prediction for the stock market.

Aim

The aim of my thesis project was to build a time serie of the polarity of tweets related to a cluster of firms, and compare it to the time serue of the corresponding firm's behaviour in the stock market. The choosen firms were: Apple, Google, Nike, Nestrlé, Beyond Meat, Bayer and NovaVax.

Data Acquisition

To download all the tweets related to the firms I used Twitter's API. I wrote all the code in R, and automated it thanks to Windows Task Manager, so that the download would have started every day at the same our, by itself. The downloaded tweets were then automatically uploaded to OneDrive, so that I could access them at any time. I also implemented an automated Gmail notification that would notified me that the download was correctly occured and would send me some general statistics.

Sentiment Analysis

I then proceded to clean all the data, lemmatize it, and analyze it. In order to calculate a sentiment score I used 3 methods:
  1. Naive Bayes
    Based on the Bayes Theorem, the algorithm classifies every tweet as "positive" or "negative" using the "MPQA Subjectivity Lexicon" by Janyce Wiebe.
  2. Syuzhet
    Uses package Syuzhet and homonym dictionary to give a score to each tweet.
  3. Udpipe
    Uses package UdPipe (with the MPQA subj lexicon) to give a score to each tweet. Has the possibility to use inensifier, weakeners and modifiers (so that it can, for example, distinguish between "good", "very good", "quite good" and "not good")

Conclusion

In order to evaluate the existence of a relationship of causality between the tweets and the closing price of the firm, I built a test based on Granger Causality Test that I called "Close Test". This test brought very positive results highlighting numerous relationship of causality. The Test Score also found that, in our cluster of firms, using the tweets that only refer to the value of the firm in the stock market (for ex containing: $AAPL, $GOOGL, ecc.) (dataset "stock") is more suitable for a short term prevision (forecasting the closing prize of the same day), while using the tweet tha refer to the company in general (for ex containing also: Apple, Google, ecc.) (dataset "score") is more suitable for longer term prevision (forecasting the value in the next days).

A deeper look into the visualization and the conclusion brought by the project can be taken by downloading the final report or visualizing the correspont repository in Github.

Tags

R API Statistics Automated Taks Time Series Analysis Sentiment Analysis udpipe Text Analysis